Skip to content

fix: make the dedup operator cover all column types#80

Merged
liulx20 merged 5 commits intoalibaba:mainfrom
liulx20:dedup
Mar 19, 2026
Merged

fix: make the dedup operator cover all column types#80
liulx20 merged 5 commits intoalibaba:mainfrom
liulx20:dedup

Conversation

@liulx20
Copy link
Collaborator

@liulx20 liulx20 commented Mar 18, 2026

Fixes #82

Greptile Summary

This PR extends the DEDUP operator to cover all column types by changing generate_dedup_offset from a void method (that terminated with LOG(FATAL) for unsupported types) to a bool method that returns false to signal "use the generic hash-based fallback." It also adds the previously-missing MSVertexColumn::generate_dedup_offset implementation and removes the row_num parameter from ColumnsUtils::generate_dedup_offset.

Key changes:

  • generate_dedup_offset now returns bool; false means "fall back to hash-based dedup" in dedup.cc
  • MSVertexColumn gains a correct generate_dedup_offset with a null_seen guard
  • ArrowArrayContextColumn's entire type-dispatched dedup implementation is removed; it now silently falls back through the base-class default, which emits LOG(ERROR) on every dedup — this will pollute production logs with false error messages for a valid code path
  • LOG(FATAL)LOG(ERROR) across MSEdgeColumn, ListColumn, StructColumn, and the base class allows graceful fallback, but LOG(ERROR) is still too severe for an expected, handled code path; LOG(WARNING) or lower would be more appropriate
  • The ColumnsUtils::generate_dedup_offset helper has a harmless but redundant resize call after the constructor already sets the correct size
  • The empty if body in dedup.cc (line 37–38) is valid C++ but makes the intent hard to read at a glance

Confidence Score: 3/5

  • Dedup is functionally correct for all column types but the removal of ArrowArrayContextColumn's fast path introduces spurious LOG(ERROR) noise on valid operations, and LOG(ERROR) across other fallback types is misleading throughout.
  • The core logic is sound — the bool-returning generate_dedup_offset pattern works correctly, and MSVertexColumn's new implementation properly handles nulls. However, removing ArrowArrayContextColumn's dedicated implementation without suppressing the base-class LOG(ERROR) will pollute production logs on every Arrow-column dedup. Additionally, LOG(ERROR) is used where a warning or debug log would be appropriate for an expected, handled code path, making genuine errors harder to distinguish in logs.
  • src/execution/common/columns/arrow_context_column.cc (removed fast-path, now logs ERROR on valid ops), include/neug/execution/common/columns/i_context_column.h and all overrides using LOG(ERROR) for graceful fallback paths

Important Files Changed

Filename Overview
src/execution/common/operators/retrieve/dedup.cc Refactors dedup to use a boolean return value pattern — single-column fast path tries generate_dedup_offset and falls back to hash-based dedup on false; has an empty if-body style issue and the fallback silently succeeds after logging a spurious error for Arrow columns.
include/neug/execution/common/columns/columns_utils.h Removes the now-redundant row_num parameter; has a harmless but redundant second resize call after the constructor already sets the correct size.
src/execution/common/columns/arrow_context_column.cc Removes the entire Arrow-type-dispatched generate_dedup_offset implementation; now falls back to the base-class which logs LOG(ERROR) and returns false, triggering the hash-based fallback — correct result but spurious error log on every ArrowArrayContextColumn dedup.
include/neug/execution/common/columns/i_context_column.h Base generate_dedup_offset changed from void/LOG(FATAL) to bool/LOG(ERROR)/return false, enabling graceful fallback; LOG(ERROR) still fires even when caller handles the false return correctly.
src/execution/common/columns/vertex_columns.cc Adds MSVertexColumn::generate_dedup_offset with correct null handling (tracks null_seen flag) and set-based dedup; updates return type to bool for all vertex column implementations.
include/neug/execution/common/columns/edge_columns.h Updates generate_dedup_offset return type to bool; MSEdgeColumn changed from LOG(FATAL) to LOG(ERROR)/return false to enable graceful fallback.
include/neug/execution/common/columns/list_columns.h ListColumn::generate_dedup_offset changed from LOG(FATAL) to LOG(ERROR)/return false; falls back to hash-based dedup in caller.
src/execution/common/columns/struct_columns.cc StructColumn::generate_dedup_offset changed from LOG(FATAL) to LOG(ERROR)/return false; falls back to hash-based dedup in caller.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Dedup called] --> B{cols empty?}
    B -- yes --> C[Return ctx unchanged]
    B -- no --> D{Single column?}
    D -- yes --> E[Call generate_dedup_offset]
    E --> F{Returns true?}
    F -- yes --> G[Fast path: offsets populated]
    F -- "no - LOG ERROR fired" --> H[Clear offsets, use hash fallback]
    D -- "no, multi-column" --> H
    H --> I[Hash fallback: encode each row via get_elem]
    I --> J[offsets = unique row indices]
    G --> K[Build result Context]
    J --> K
    K --> L[reshuffle offsets]
    L --> M[Return ret]

    subgraph FastPathTypes["Fast-path column types"]
        FP1[SLVertexColumn - bitset]
        FP2[MSVertexColumn - set plus null flag - NEW]
        FP3[MLVertexColumn - set]
        FP4[PathColumn - sort dedup vector]
        FP5[ValueColumn - sort or set]
        FP6[SDSLEdgeColumn, BDSLEdgeColumn, SDMLEdgeColumn, BDMLEdgeColumn]
    end

    subgraph FallbackTypes["No fast path - hash fallback"]
        FB1[ArrowArrayContextColumn - base class LOG ERROR]
        FB2[MSEdgeColumn - LOG ERROR]
        FB3[ListColumn - LOG ERROR]
        FB4[StructColumn - LOG ERROR]
    end
Loading

Comments Outside Diff (1)

  1. src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)

    P2 Fragile fallback when generate_dedup_offset fails mid-way

    The new control flow silently falls through to the slow-path else branch whenever generate_dedup_offset returns false. If a future generate_dedup_offset implementation partially populates offsets before returning false (e.g., due to an error mid-loop), the else branch would append additional entries onto the already-partially-populated vector, producing a corrupted, non-deduplicated result.

    All current false-returning implementations happen to leave offsets untouched, so this is not a bug today. But the contract is nowhere documented: there is no offsets.clear() guard at the top of the else branch, and no documented requirement that callers of generate_dedup_offset leave offsets unchanged on failure.

    Consider adding a defensive offsets.clear() at the start of the else branch, or documenting the contract that generate_dedup_offset must not modify offsets on failure:

  2. src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)

    P1 Missing offsets.clear() in fallback branch

    When cols.size() == 1 and generate_dedup_offset returns false, the code falls into the else branch. Currently all false-returning implementations leave offsets untouched (e.g. MSEdgeColumn, ListColumn, StructColumn), so the else block sees an empty vector and fills it correctly. However, the else block never explicitly clears offsets before appending to it.

    If any future generate_dedup_offset implementation partially populates offsets before discovering a failure and returning false (which is perfectly reasonable), the stale entries would be mixed with the fresh ones produced by the else block, causing duplicate or incorrect result rows.

    The fix is to add offsets.clear() at the start of the else block:

  3. include/neug/execution/common/columns/columns_utils.h, line 43 (link)

    P1 Out-of-bounds access when row_num == 0

    offsets.push_back(row_indices[0]) is called unconditionally. If row_num is zero, row_indices is empty and row_indices[0] is undefined behaviour. Although this is pre-existing code, this PR adds several new callers that now return true and reach this path (PathColumn, BDSLEdgeColumn, SDMLEdgeColumn, BDMLEdgeColumn), widening the exposure. A guard is needed:

Last reviewed commit: "fix"

Greptile also left 2 inline comments on this PR.

@liulx20
Copy link
Collaborator Author

liulx20 commented Mar 18, 2026

@greptile

@liulx20
Copy link
Collaborator Author

liulx20 commented Mar 18, 2026

@greptile

@liulx20 liulx20 requested a review from shirly121 March 19, 2026 02:17
@liulx20 liulx20 merged commit c5d5d79 into alibaba:main Mar 19, 2026
16 checks passed
liulx20 added a commit to liulx20/neug that referenced this pull request Mar 20, 2026
* make dedup operator cover all column types

* format

* fix
longbinlai added a commit that referenced this pull request Mar 20, 2026
* add java sdk

* add test cases

* Update tools/java_driver/USAGE.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update tools/java_driver/USAGE.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix some issues

* add ClientTest

* update doc

* fix doc

* Update tools/java_driver/pom.xml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tools/java_driver/src/test/java/org/alibaba/neug/driver/InternalResultSetTest.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* format

* rename org to com

* fix doc

* add result metadata

* fix

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* add tests

* add doc

* add maven

* Update tools/java_driver/src/main/java/com/alibaba/neug/driver/utils/Client.java

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalResultSet.java

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* add e2e ci

* add param test

* format

* Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalSession.java

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update InternalSession.java

* remove pb generated

* fix doc

* fix doc

* fix doc

* fix workflows

* fix version

* fix generator

* fix maven action

* fix: catch OSError in neug-cli readline history loading on macOS (#75)

* fix: catch OSError in neug-cli readline history loading on macOS

On macOS, Python's readline module is backed by libedit instead of GNU
readline. When ~/.neug_history was written by a GNU readline session
(e.g. from Docker/Linux), libedit raises OSError (errno 22 EINVAL)
instead of silently handling the incompatible format.

The original code only caught FileNotFoundError, causing neug-cli to
crash on startup. Broaden the exception handler to also catch OSError so
the history file is simply skipped, matching the intended behavior.

Fixes #74

* fix: scope OSError catch to errno.EINVAL for libedit incompatibility

Per greptile review: catching the full OSError base class could silently
swallow unrelated errors such as PermissionError or IsADirectoryError.
Narrow the catch to only suppress errno.EINVAL (22), which is the specific
error raised by macOS libedit when it encounters a GNU readline history
file. All other OSError variants are re-raised so users see genuine
problems.

Also add 'import errno' to top-level imports.

* Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalDriver.java

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix getBigDecimal

* Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalResultSet.java

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalResultSet.java

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix getObject

* feat: Support Export Query Results to JSON/JSONL file (#60)

* support export arrow table to csv format

Committed-by: Xiaoli Zhou from Dev container

* export query response PB to csv format

Committed-by: Xiaoli Zhou from Dev container

* minor fix according to review

Committed-by: Xiaoli Zhou from Dev container

* fix according to review

Committed-by: Xiaoli Zhou from Dev container

* minor fix

Committed-by: Xiaoli Zhou from Dev container

* support export query results to json format

Committed-by: Xiaoli Zhou from Dev container

* minor fix

Committed-by: Xiaoli Zhou from Dev container

* remove 'newline_delimited' settings and detect jsonl format from path

Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container

* minor fix

Committed-by: Xiaoli Zhou from Dev container

* add export to json tests in CI

Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container

* Update extension/json/src/json_export_function.cc

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update extension/json/src/json_export_function.cc

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update extension/json/src/json_export_function.cc

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* minor fix

Committed-by: Xiaoli Zhou from Dev container

* minor fix

Committed-by: Xiaoli Zhou from Dev container

* refine extension tests anotation

Committed-by: Xiaoli Zhou from Dev container

* minor fix

Committed-by: Xiaoli Zhou from Dev container

* rename INSTALL_EXTENSIONS to CI_INSTALL_EXTENSIONS to avoid conflict

Committed-by: Xiaoli Zhou from Dev container

* refine json extension tests ci

Committed-by: Xiaoli Zhou from Dev container

* minor fix

Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* remove bytearray

* add codegraph-qa skill (#78)

* fix: Fix default value support for all type of properties (#63)

Refactor the default value support for storage, avoid exposing default_value on column and mmap_array


---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix: Fix incorrect edge table state when transforming between bundled and unbundled (#28)

Fix incorrect edge table state when transforming between bundled and unbundled, include special case for string properties

* fix: make the dedup operator cover all column types (#80)

* make dedup operator cover all column types

* format

* fix

* Correct the is_optional interface behavior for certain columns (#90)

* add a codegraph example (#87)

Co-authored-by: Longbin Lai <longbin.lai@gmail.com>

* add checkRowIndex

* add update_was_null

* update doc

* fix

* update doc

* fix

* Implement the iteration method for QueryResult

* update query_result.md

* update

* update doc

* format example

* format

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Longbin Lai <longbin.lai@gmail.com>
Co-authored-by: Xiaoli Zhou <yihe.zxl@alibaba-inc.com>
Co-authored-by: BingqingLyu <bingqing.lbq@alibaba-inc.com>
Co-authored-by: Zhang Lei <xiaolei.zl@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Engine error: MSVertexColumn not implemented for dedup operator

2 participants